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Abstract 

Background: \der\i\f'\cation of DNA/Protein motifs is a crucial problem for biologists. Computational techniques could be of 
great help in this identification. In this direction, many computational models for motifs have been proposed in the 
literature. 

Methods: One such important model is the (£,d) motif model. In this paper we describe a motif search web tool that 
predominantly employs this motif model. This web tool exploits the state-of-the art algorithms for solving the (£,d) motif 
search problem. 

Results: The online tool has been helping scientists identify many unknown motifs. Many of our predictions have been 
successfully verified as well. We hope that this paper will expose this crucial tool to many more scientists. 
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Introduction 

Motif search is an important problem in biology. Computa- 
tional techniques could greatly help in solving this problem. A 
number of computational motif search tools can be found in the 
literature. See e.g., PRATT [1], MEME [2], DILIMOT [3], 
SLiMDisc [4], SLiMFinder [5] and FIRE-pro [6]. 

Each of the above tools is based on a specific model of motif 
search. An important model for motifs is the (l,d)-vciot£ search 
model. A simple version of this model can be stated as follows. We 
are given n input sequences S\,S2, ■ ■ ■ ,s„ each of length m. Input 
are also two integers £ and d. The problem is to find a motif M 
that is present in the n input sequences. It is known that M is of 
length £ and that it occurs in each of the n input sequences within a 
Hamming distance of d. 

This model has been shown to yield better sensitivities than that 
of other models when tested on known biological data (see e.g., 
[7]). The problem of (i',rf)-motif search is intractable [8]. There 
are numerous algorithms that have been proposed for solving the 
{£,d)-moUi search problem. Examples are RISO and RISOTTO 
[9] . But RISO and RISOTTO are down-loadable programs and 
there are no corresponding web systems. In this paper we describe 
a web system for motif search that uses the (£,rf)-motif model. Our 
web system has the following features: 1) We employ several state 
of the art algorithms for {£,d)-motii search. We can identify longer 
motifs than RISO and RISOTTO. RISO can only identify motifs 
of length up to 14. PMS can identify motifs of length up to 23; 2) 
Both DNA and protein motifs are supported; 3) We support 
quorum motif search. In this case the motif(s) need not occur in all 



the input sequences. Qriorum motif search is significantly more 
difficult than the regular version [10]; 4) Dyads motifs are also 
found. In particular, the dyad motif under concern could consist of 
two segments separated by a gap; 5) We employ a scoring 
mechanism for the putative motifs found; and 6) The user interface 
for PMS is very friendly; 7) In PMS, user emails are optional. 

To the best of our knowledge, there is no other comprehensive 
motif search system, based on the (^,rf)-motif model, comparable 
to ours. 

Results 

The PMS Webserver 

The PMS server is freely available at http://pnis.engr.uconn. 
edu or at http:/ /motifsearch.com. The website is open to any user. 
Login is not required. However, any user with a login account wiU 
have the benefit of viewing and retrieving his or her submission(s) 
history. Also, a submission associated with a registered user will be 
kept in the system forever unless the user deletes it. Any submission 
from a user without a login account will be stored in the system for 
one month. It will be automatically removed after one month. 

The purpose of the motif search tool is to help biologists identify 
novel motifs that may be present in input DNA and/ or Protein 
sequences. Simple and user-friendly input forms will allow users to 
submit queries easily and quickly. Informative output and 
visualizations wiU permit users to analyze the results carefully. 
These features of the website are described in more detail in the 
following sections. 
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Input Sequences and Parameters 

The input data can be either DNA or protein sequences. The 
length of each sequence is required to be between 15 and 1000. 
The number of input sequences is required to be between 5 and 
500. The input sequences should be organized in the well-known 
text-based format - FASTA. 

For each input dataset, a set of parameters will be chosen by the 
user. These parameters are shown in Figure 1 . The first parameter 
is called quomm percent" which is the minimum percentage of the 
input sequences that contain motifs. Quorum percent is set to 75% 
by default. 

The second parameter allows users to choose the structure of 
motifs. Currently, the tool considers two structures, namely, 
monads and dyads. A monad is a contiguous string and a dyad 
consists of two segments separated by a gap. A monad is assumed 
by default. For monads, the users will choose the motif length. By 
default, the motif length is chosen to be "Any" which means that 
the tool will search for motifs of lengths between 10 and 25. If 
information about the motif length is known, we recommend that 
it be used to reduce the processing time. For dyads, users should 
choose the length of the first segment or box, the length of the 
second box and the length of the gap between the two boxes. If the 
lengths are chosen to be "Any", processing will proceed similar to 
that for monads. 

The third parameter is for DNA sequences that allows users to 
have the option of considering the reverse complement sequences. 
If the input DNA sequences have the same orientation, the third 
parameter should be chosen to be "No". Otherwise, we 
recommend that it be chosen to be "Yes". 

Submitting Jobs 

After entering the sequences and relevant parameters, the user 
clicks on the "Submit" button on the submission form. If the data 
entered are vahd, the submission will enter the processing queue. 
Once the processing is over, a results page will be displayed. 
Information about the submission will appear on top of the results 



page as shown in Figure 2. Users can update contact email or 
change the parameters by clicking on either the "Update" button 
or the "Change parameters" button, respectively. 

After submission, the submission status could be one of these: in 
processing queue, being processed, and processed. If the submis- 
sion has not been processed yet, the bottom of the results page wiU 
appear as shown in Figure 3. Users can click on the "Refresh" 
button to update the processing status. Users can either wait for 
their submission to be processed or bookmark the results page and 
return to it later. If the contact email is provided, the system wiU 
send a notification email when the submission is processed. The 
notification email wiU include the URL for the results page. 

The processing time of any submission varies from a few 
minutes to a few hours, depending on the data, the parameters, 
and the workload of the server. If the user feels that the tool is 
taking too much time to process, we recommend that (s)he provide 
his/her contact email. Providing emails has a number of benefits. 
The first benefit is that the user wiU receive email notifications 
when query processing is complete. The second benefit is that their 
submissions will be stored in the system as long as they want. The 
third and perhaps the most important benefit is that they can 
retrieve their submission histories (as discussed in the next section). 

Output 

Once the submission is processed, the bottom of the results page 
will appear as shown in Figure 4. Identified candidate motif(s) wiU 
appear on the left and the input sequences will appear on the right. 
If no motifs are found, we recommend to reduce the value of the 
quorum percent. 

The candidate motif(s) found are ranked according to their 
scores. The score of a candidate motif is the logarithm of the 
probability that the motif occurs by random chance. The smaller 
the score, the more biologically significant the motif is. For more 
details on the scoring scheme, the readers are referred to [10]. For 
each candidate motif, users can click on the "View motif 
locations" button corresponding to the motif in order to view its 



• The percentage of DNA sequences containing motifs * (required): 75 ^ 



• Please let us know about the form of motifs *(required): 



© Single-box binding site 



o Tine length of the binding site: | Any 



O Double-box binding site 



0 The length of the first binding site: Any 
o The le ngth of the gap between the two binding 

sites: |Any v| 
o The length of the second binding site: I Any v 



Binding protein 




DNA string 



Binding site length 
Binding protein 




DNA string 



Gap length 



First binding Second binding 
site length site length 



• Do you want to add the reverse complement sequences? *(required): O Yes 0 No 

Figure 1. Parameters for DNA sequences. The set of required parameters for DNA sequences. The first parameter is "quorum percent" which is 
the minimum percentage of the input sequences containing motifs. The second parameter allows users to choose the structure of motifs. 
doi:1 0.1 371 /journal.pone.0080660.g001 
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Query Information 



Query No: 458 
Contact em ail: unknown 

I Update I 



Single binding ^ite: 

a Binding site length: Any 
Quorum Percent: 75 



Processing status: being processed 
Search mode: Full Search 
Submitted time: 2D12-03-01 17:08:36 
Add reverse complement sequences: No 

Description: 

Change parameters 



Figure 2. Query information. Information about submission. Users can click on the "Update" button or the "Change parameters" button to 
update the contact email or change parameters. 
doi:1 0.1 371 /journal.pone.0080660.g002 



locations, i.e., its instances, in the input sequences. The locations of 
the motif instances will be highlighted in the input sequences as 
shown in Figure 4. The probability weight matrix of the motif is 
directly calculated through its motif instances and wiU appear 
above the input sequences. The probability of a DNA character at 
each column in the probability weight matrix is its frequency when 
its motif instances are aligned. When a motif is chosen, users can 
click on the "Save motif locations in text" button to save its 
locations in a text file. 

For input protein sequences, the results are shown in Figure 5 
which is similar to that of DNA sequences except that the 
probability weight matrix is not shown because it would be large 
for protein sequences. 

Submissions History 

The website allows users to easily manage their submission(s) 
history. To start the submissions history feature, click on the link 
"Submission history" on the left menu of the website. To view 
submissions history, enter the contact email and password on the 
submissions history form. If the password has not been set by a 
user yet, (s)he can go to the reset password form and enter the 
contact email. An email will be sent to the contact email including 
a URL that allows the user to reset the password. 

The list of submissions will be shown as in Figure 6. Users can 
sort their submissions based on query ID, submission time, or 
processing status. If the users want to view a particular submission, 
they can click on the link "View detail" of the corresponding 
submission. 



Feedback 

The website supports an extensive feedback section. Users can 
easily submit feedbacks, comments, and questions using the 
feedback form. Feedbacks and comments will help us improve the 
website. To access the feedback form, click on the link "Feedback" 
on the left menu of the website. 

Discussion 

In this paper we have described a new web tool for motif search 
called PMS. This tool is based on the (l,d)-mot£ search model. 
This is a comprehensive web tool offering many crucial features 
and we are not aware of any other computational motif search tool 
comparable to ours. In future we plan to support additional 
features. For example, we will identify candidate motifs with more 
than two segments (separated by gaps). Another important feature 
will be to score the candidate motifs based on experimental data 
publicly available. User feedbacks wiU also be taken into account 
in enhancing the features of our web tool PMS. We also plan to 
incorporate other motif models in future. In addition we plan to 
work on finding longer motifs. 

Materials and Methods 

Our online motif search tool is built on state-of-the-art 
algorithms for the most well-known motif model - {(.,d)-xaov£ 
search or the Planted Motif Search (PMS). The PMS model has 
been shown to be very effective in identifying motifs (see e.g., [7]). 
The PMS Problem is defined as follows. 



Motifs found 

Not availab le, PLease wait , 
I Refresh I 



Input DNA sequences 



You can save the link, chse ycur vteb 
browser arid check it back later. Or let us 
know •/our email 6/ clicking the 'Update' 
button. We will send you a notification 
email as soon as the result is available. 
The notification email will include the linli 
to the result. Your email will be kept 
confidential. Thank you. 



Seq 1 

ACCCTCTTGTATAGCAGCAGATGACC GGGTGTTG C CACTCATAGCCTTCC 

GATGGAGAGAAGCGCGGGCCACTAGAAGATAATGTCGGQCCCTT6AGCGC 
GCC 



Seq 2 

TTGACTG G ACTGCTCAAATC G ATC AG GAATTTG G CC G G CTAWWCC CTG 
GACGACAAAC CGTTGCCCAGGAGACGGTCTGACCC GTCTAATG AG C G ACT 
ATAAGGTC GTCGCCACGGTATGTCCCAAGG GTAAAAC CAACTG AC C ACTG 
GCCAATC AG AAAGTAG ATTG GCCACGTACTCGCCTCTCTCTCGTCAGGAA 
CAACATAAG C CCGAC CG G 



Seq 3 

ACCCTCTTGTATAG CAGCAGATGACCGGGTGTTGCCACTCATAGCCTTCC 
GATGSAGAG AAG C G CG G G C C ACTAG AAG ATAATGTC GGGCCCTTGAGCGC 
GCC 



Figure 3. Result not available. An example of the results page when the submission has not been processed yet. Users can click on the "Refresh" 
button to update the processing status. 
doi:1 0.1 371/journal.pone.0080660.g003 
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Motifs found 

Found 100 motifs 

• Mo tifs are ranked based on 
biological significance 

• Select a motif to view its occuring 
locations on the input seQuences 
on the left panel 

• Selected motif is in yellow 
background 

Motif 1: 
TTTTCCATAT 

Length; 10 

Quorum percent; 93% 
Occurs in 10 input sequences 
Average #mismatches: 1 



Seve motif locations 



Input DNA sequences 



Probability matcbing matrix of selected motif 

Motif 1: TTTTCCATAT 
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Seq 1 has 9 matches against Motif 1 TTTTCCATAT 
Comment: seq_l tjrn 



Save motif locations in text 



CAATGTATAAGGTAMTTGGTTTGATTCCAAGA^^TG^TTTCCAiATAGTC 
TC C ACTC GAAAGAACATGTTTTC C AAAAAGAGTTTGTTAAAG C GTITC C A 



Seq 2 has 9 matches against Motif 1 TTTTCCATAT 
Comment: seq_2 tjl 



Motif 2: 
TATTTTCCA 

Length; 9 



View motif locations 



JfiAG GTWWTG GTTC GATTC C AAGAATGTTTTC C A a ATAGTCTC C ACTC 
GAAAGAACATGTTTTC CAAAAAGAGTTTGTTAAAGTGTTTC CA 



Seq 3 has 9 matches against Motif 1 TTTTCCATAT 
Comment: seq_3 tj2 



Figure 4. Result available - DNA sequences. An example of the results page when the submission is processed. The locations of the second 
motif are marked on the DNA input sequences. 
doi:1 0.1 371/journal.pone.0080660.g004 



Motifs found 

Found 100 motifs 

• Motifs are ranked based on 
biological significance 

• Select a motif to view its occuring 
locations on the input sequences 
on the left panel 

• Selected motif is in yellow 
background 

Motif 1: 

LVALIPYSDQRL 

Length; 12 

Quorum percent; 60% 
Occurs in 3 input sequences 
Average #mismatches; 0,67 



Save motif locations 



Save motif locations in text 



Motif 2: 

LVALIPYSDQR 

Length; 11 



View motif locations 



Input protein sequences 

Seq 1 has 11 matches against Motif 1 LVALIPYSDQRL 
Comment: gi\21450796\ref\NP_65947S .1 \ transmembrane protein 
106A [Homo sapiens] 

MGKTFSQLGSWREDENKSILSSKPAIGSKAVNYSSTGSSKSFCSCVPCEG 
TADASPvTCPTCQGSGKIPQELEKQLVALIPYgDQRLKPKHTKLFVFLAV 
LI C LVTS S FIVFFLFP RS VI VQ PAG LN S STVAFD EAD I YLN ITN ILN I S N 
G NYYP I M VTQ LTLEVLH LS LWGQVS NN LLLH I G P LAS E Q M FYAVATK I R 
DENTYKICTWLEIKVHHVLLHIQGTLTCSYLSHSEQLVFQSYEWDCRGN 
ASVPHQLTPHPP 



Seq 2 has 12 matches against Motif 1 LVALIPYSDQRL 
Comment: gi\1971ie3S9\ref\NP_001127704.1\ transmembrane 
protein 106B [Homo sapiens] 

MGKSLSHLPLHSSKEDAYDGVTSENMRNGLVNSEVHNEDGRNGDVSQFPY 
VE FIG RD S VTC PTC QGTG RI P RG Q E N QIL V A LI P Y S D QRLRP RRTK LYVM A 
S VFVC LLLS G LAVFFLFP RS I D VK YI GVK S AYVS YDVQ K RTIYLN ITNTL 
N ITN N NVYSVE VE N ITAQ VQ FS KTVI G K ARLN N ITII G P LD M K Q I DYTVP 
TVIAEEMSYMYDFCTLISIKVHNIVLMMQVTVTTTYFGHSEQISQERYQY 
VDCGRNTTYQLGQSEYLNVLQPQQ 



Seq 3 has 11 matches against Motif 1 LVALIPYSDQRL 
Comment: gi\780703S6\gb\AAI07793.1 \ TMEM106C protein [Homo 
sapiens] 



Figure 5. Result available - Protein sequences. An example of the results page when the submission is processed. The locations of the second 
motif are marked on the protein input sequences. 
doi:1 0.1 371/journal.pone.0080660.g005 
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Figure 6. Submission history. An example of the submissions history. Users can sort their submissions based on query ID, submission time, or 
processing status. Users can view a particular submission in detail by clicking on the link "View detail" of the according submission. 
doi:1 0.1 371/journal.pone.0080660.g006 



Definition 0.1 PMS Problem: given n sequences and integer 
parameters £,d and q, find all strings M of length t such that M 
appears in at least q out of the n given sequences within d mutations. 
Each such string M is a putative motf. Any l-nm (i.e., a substring of 
length I) Lin any input string such that the Hamming distance between 
M and L is at most d is known as an instance of the motf M. 

The PMS Algorithms 

In our web tool, we tiave used a combination of tlie current best 
PMS algorithms proposed in [10], [11], and [12]. 

We now summarize some of the techniques u.sed in these 
algorithms. 

Let HD{x,y) stand for the Hamming distance between two strings 
X and of the same length. Let 51,^2, •• • ,s„ be the given set of input 
sequences each oflengthm. For simplicity, considerthe version where 
q = n. The PMSO algorithm works as follows [13]: Consider s\ . Let L 
be an ^-mer of Define the ^/-neighborhood Bij(L) of L to be the 
collection of all the ^-mers q such that HD(L,q) < d. If L is an instance 
of an (£,d)-motifM, then, clearly M will be in Bd(L). However, we do 
not know which ^-mers of S\ axe instances of the motif we are looking 
for. Thus, PMSO constructs Bj{L) for every ^-mer L in .Si . It then 
performs a union Ci of all of these (/-neighborhoods. C\ contains all 
the (£,d)-motik. For each ^-mer L in Ci , the algorithm checks if L is an 
{£,d)-moUtoT not in an obvious manner. Note that for a given ^-mer L, 
we check if it is an (i',<i)-motif or not in 0{mn£) time. A variation of this 
algorithm is called PMS 1 and is described below [13]: 

Algorithm PMS1 

1. Compute C, for each input sequence \<i<n. Here 
C, = Ul&v Bd(L). In other words, C, is nothing but the union 



of (/-neighborhoods of all the ^-mers in s,, \<i<n. The 
notation LeSi indicates that the ^-mer L is a substring in s,. 

2. The (^,rf)-motifs are now computed as n"=i C/. 

Algorithm PMS5 can be thought of as an extension of PMSO 
[11]. If 5 is a collection of strings, let Mf{S) denote the (£,d)- 
motifs present in 5. If the input sequences are si,S2, ■ ■ ■ ,Sn, let 
5 = {^1,52, ■ • • ,Sn} and let 5" = {s2,st„ . . . ,s„}. The idea of PMS5 is 
to compute the (f ,(i)-motifs of S as |Jz,ej, Mf(L,S'). 

In order to compute Mf(L,S') for any ^-mer L, the algorithm 
uses a subroutine to compute the common (/-neighborhood of 
three ^-mers. Specifically, let x,y,z be any three i'-mers. We use 
Bj(x,y,z) to denote the common (/-neighborhood of x,^^, and z. In 
other words, Bj(x,y,z) is nothing but the set of all i!-mers that are 
at a distance of no more than d from each of the three i'-mers x,y, 
and z. 

To compute Bd(x,y,z), PMS5 represents Bd{x) as a tree Td{x). 
Each node in this tree is an ^-mer in Bd{x). The root of T^(x) is 
the ^-mer x. The depth of Td{x) is d. Td(x) is traversed in a 
depth-first manner. Let t be any node in this tree. During the 
traversal, / will be output if 7 is in Bd(y)r\Bd(z). While visiting any 
node ?, we check if there is a descendent t' of t such that t' is in 
Bd(y)r\Bd(z). The subtree rooted at t will be pruned if there is no 
such descendent. The problem of checking if t has any descendent 
that is in Bd(y)C\Bd{z) is formulated as an integer linear program 
(ILP) on ten variables. This ILP is solved in 0(1) time. 

Any algorithm for solving the PMS problem when q^^n is 
typically named with a prefix of 'q'. One of the first algorithms to 
address this version of the PMS problem was qPMSPrune [12]. 
Algorithm qPMSPrune is based on the following observation: If M 
is any (^,(/,^)-motif of the input strings ^i, . . . ,s„, then there exists 
an / (with l<i<n — q-\-l) and an ^-mer xesi such that M is in 
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Bj{x) and M is an {£,d,q— l)-motif of the input strings excluding 
Si. The algorithm runs through every possible value of i, l<i<n. 
For a given value of it considers every ^-mer x of .v,. Specifically, 
it constructs Bd{x) and identifies elements of Bd(x) that are 
{t,d,q—\) motifs (with respect to input strings other than si). 
Bj{x) is represented as a tree with x as the root. This tree is 
traversed in a depth first manner and some pruning conditions are 
used to prune subtrees that do not have any motifs. 

Algorithm qPMS7 of [10] extends the observation of 
qPMSPrune as follows: If M is any {£,d,q)-motif of the input 
strings Si,... ,s„, then there exist \ <i^j<n and £-mer xe.s, and 
£-mer yesj such that M is in Bd{x)C^Bd{y) and M is an 
(^,(/,^ — 2)-motif of the input strings excluding Sj and Sj. qPMS7 
considers every possible pair (ij), 1 < ij < n and / # j. For a given 
pair (iJ), every possible pair of i?-mers {x,y) is considered (where x 
is from Si and y is from Sj). For a given x and y, the algorithm finds 
all the elements of Bii{x)r\Bd(y) that are {i,d,q — 2) motifs (with 
respect to input strings other than Si and Sj). Bj{x)r\Bd(y) is 
explored by traversing an acyclic graph, denoted as Qd(x,y). 
Qd(x,y) is traversed in a depth first manner. Here again effective 
pruning conditions are used to prune subgraphs of Qd{x,y). 

For more details about the PMS algorithms, the readers are 
referred to the respective papers. 

An Experimental Validation of PIVIS Algorithms 

Planted motif search is just one computational model for motifs. 
An important question is how efficient is this model in identifying 
motifs from real biological data. In fact the same question is 
relevant for any (computational or other) motif model. In [14], 
Tompa, et al. have evaluated the performance of 13 different 
motif finding programs: AlignACE, ANN-Spec, Consensus, 
GLAM, The Improbizer, MEME, MITRA, MotifSampler, 
Oligo/dyad-analysis, QuickScore, SeSiMCMC, Weeder and 
YMF. These programs were evaluated on several biological 
datasets (for which the motifs were known via experimental 
techniques) based on many different performance measures. Two 
of the performance measures employed were sensitivity and 
specificity. Sensitivity represents the fraction of sites that were 
correctiy predicted and specificity represents the fraction of non- 
sites that were correct. 

In [7], Sharma, et al. have evaluated the performance of PMS 
algorithms. In particular, they have employed the same 56 datasets 
that were used by Tompa, et al. [14]. As a result, Sharma, et al. 
have compared the PMS algorithms with the thirteen programs 
evaluated in [14]. Several versions of the PMS algorithms have 
been tested. One of these versions, namely, PMS SumMinD yields 
an average sensitivity of 28.8% and a specificity of 91.63% on all 
the 56 datasets. In comparison, the best of the 1 3 algorithms tested 
by Tompa, et al. [14], ANN-Spec, has an average sensitivity of 
8.7% and a specificity of 98.22%. 

Our Motif Search Framework 

In addition to the PMS algorithms, we deploy a motif search 
framework that uses the PMS algorithms as underlying routines. 
The motif search framework basically works as follows. The user 
inputs a set of sequences that contain motifs of interest. The 
framework runs a PMS algorithm (qPMS7 as of now) with 
different triples of the parameters (£,d,q) and collects all of the 
output motifs. These motifs are called candidate motifs. Then, it 
uses a score function that ranks the candidate motifs. The score 
function measures the significance of a candidate motif based on 
the probability that it occurs by random chance. Finally, the tool 
outputs the top 100 motifs with the highest scores. The score of a 



candidate motif will be high if the probability that it occurs by 
random chance is low. 

Since the run time of PMS algorithms is exponentially 
dependent on the parameter d, i.e. maximum number of 
mutations allowed, we let the user indirec:tly set the parameter 
through the computational preferences, "Quick Search" or "FuU 
Search". If the "Quick Search" option is chosen, then the 
parameter d is set to a 'low' value (3, specifically). Conversely if the 
"Full Search" option is chosen, then the parameter d is set to a 
higher value (7, specifically). 

Identifying Motif Instances in the Input Sequences 

Once a motif is found, its instances in the input sequences wiU 
be located as follows. For each input sequence, the location of the 
motif instance in the input sequence is the place where the motif 
matches the most. The motif location can be done easily by 
scanning through the entire input sequence. 

Techniques to Identify Dyad Motifs 

Eskin and Pevzner have presented an algorithm for finding 
dyads motifs [15]. This algorithm works as follows. Let the input 
sequences be Si,S2, . . .,s„ and let the length of each sequence be 
m. A dyad is characterized with the parameters {l\,g\,g2,£2,d,k). 
Here l\ is the length of the first segment, (.2 is the length of the 
second segment, the length of the gap between the two segments 
can be in the range [5^1,^2], and the dyad occurs in at least k out of 
the n sequences with a Hamming distance of at most d. For each 
input sequence Si, the algorithm generates all the relevant ^-mers 
(where (. = l\ +£2)- Any such ^-mer will be such that its prefix of 
length £1 will be an ^i-mer in some input sequence s + i, its suffix 
of length £2 will be an £2-mer in the same sequence Si, the prefix 
occurs to the left of the suffix, and the length of the gap between 
the prefix and the suffix is in the range [gi ,^2] • Note that there are 
0{mn(g2—giy) such ^-mers. Let C be this collection of i'-mers. 
After having generated these /-mers, they use the mismatch tree 
data structure to identify the /-mers that correspond to valid 
dyads. In particular, any £-mer wiU be output as a dyad if there is a 
(/-neighbor of this ^-mer that occurs in at least k of the input 
sequences. 

We speed up the above algorithm exploiting the PMSl 
algorithm. The improvement works as follows. We generate the 
£-mers for each sequence as in the algorithm of [15]. There are 
0{m(g2—giy) ^-mers for each sequence. Let Q be the collection 
of f-mers from sequence s,, for \<i<n. For each f-mer of C,- 
generate its li-neighborhood (i.e., ^-mers that are within a 
Hamming distance of d from the £-mer), for I < i < n. Let Cfj 
be the collection of J-neighbors of all the ^-mers of 5,-, for 1 < i<n. 
We can output rf-neighbors that are in at least k of these 
collections. One way of finding such i'-mers will be with the help of 
hashing. Another way is to make use of integer sorting. For 
example, we can sort each Cfj (for 1 < i < n), merge these sorted 
lists, and go through the merged list to count the number of 
sequences each such (/-neighbor occurs in. 

Availability and Requirements 

Project name: PMS - Panoptic Motif Search Tool. Project home 
page: http://pms.engr.uconn.edu or http://motifsearch.com. 
Licence: PMS tools will be readily available to any scientist 
wishing to use it for non-commercial purposes, without restric- 
tions. The online tool is freely available without login. 
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